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Abstract 

Detecting all the strings that occur in a text more frequently or less 
frequently than expected according to an IID or a Markov model is a basic 
problem in string mining, yet current algorithms are based on data struc¬ 
tures that are either space-inefficient or incur large slowdowns, and current 
implementations cannot scale to genomes or metagenomes in practice. In 
this paper we engineer an algorithm based on the suffix tree of a string to 
use just a small data structure built on the Burrows-Wheeler transform, 
and a stack of 0(a 2 log 2 n) bits, where n is the length of the string and 
a is the size of the alphabet. The size of the stack is o(n) except for 
very large values of a. We further improve the algorithm by removing its 
time dependency on a, by reporting only a subset of the maximal repeats 
and of the minimal rare words of the string, and by detecting and scoring 
candidate under-represented strings that do not occur in the string. Our 
algorithms are practical and work directly on the BWT, thus they can be 
immediately applied to a number of existing datasets that are available 
in this form, returning this string mining problem to a manageable scale. 


1 Introduction 

Detecting all the patterns of a string whose number of occurrences matches some 
notion of statistical surprise is a fundamental requirement of the post-genonre 
era, in which textual datasets grow faster than the ability to understand them, 
and in which over- and under-representation with respect to a statistical model 
is often an indicator of structure or function. The sheer volume of the available 
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datasets makes even simple models of patterns and simple measures of statistical 
surprise useful in practice, if their detection scales to extremely long strings in 
reasonable time and space. In this paper we focus on the simplest possible 
model of a pattern - a string W, of any length, that occurs without mismatches 
fr(W) times in a text T of length n - and we consider measures of statistical 
surprise that score W according to the expected number E [friW)] and to the 
variance V[/t(W")] of the number of its occurrences in a random text of length 
|T| (see e.g. [T] and references therein). We assume that the random source 
is a given Markov chain, and for concreteness we set its order to zero, since 
this simple case already captures the computational structure of the problem 
P]. Moreover, we focus on computing V[/t(W)], since computing expectations 
with respect to a Markov chain of order zero is trivial (see e.g. [5]). 

V[/t(W)] enjoys the remarkable property that its computation can be car¬ 
ried out by iterating over all the proper borders of W, i.e. over all the nonempty 
substrings of W shorter than W that are at the same time prefix and suffix of 
W : see [3] for a detailed derivation. To make the paper self-contained, here we 
just recall that V[/t(W’)] can be computed in constant time from the functions 
(f){W) and 7 (W) defined below: 

<j>(W) = Y (n-2|W1 +6 + 1) -n(W[b..\W\ - 1]) 

beB(W) 

7 (W) = Y n(W[b..\W\ - 1]) 

6eB(W) 

where strings are indexed from zero, -k{W) = IlS ” 1 P[Wj], bE* is the zth 
character of string W, P[c] is the probability of character c according to the 
given zero-order Markov chain, and B(W) is the set of all border lengths of W. 
Removing the components of V[/r(W)] that depend on borders can cause large 
relative errors in practice [ 2 ], so we focus on the exact computation of V[/t’(W / )]. 
It is well known that borders have a recursive structure, in the sense that the 
set of borders of W consists of the longest border V of W, and of all the borders 
of V. This observation enables one to map the computation of V[/r(bP)] for a 
given W onto the Morris-Pratt algorithm m , thus achieving time 0 (|W|) in 
the worst case [3]. Specifically: 

<j){W) = S(W)-(^{B)-2(\W\-\B\)'y(B)+n-2\W\ + \B\ + l) 

7 (W) = 5(W)- (1+7(5)) 

where B is the longest border of W, and 5{W) = 7 r(IT[|I 3 |..|IT| — 1]). In prac¬ 
tice, however, we are interested in extracting from a string T all its substrings 
W, of any length , such that a user-specified measure of surprise computed on 
E[f T (W)] and V[/ 7 ’(W / )] is, say, greater than a threshold. Even though com¬ 
puting Y[fr(W)] takes 0(|W|) time for a given W, enumerating and scoring in 
this way all substrings of a text T of length n takes 0(n 2 ) time. 

Luckily, a number of statistical scores z(W) enjoy the additional property 
that z(XWY) > z(W) if fr(XWY) = fr(W), where A' and Y are strings 
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|T]. Consider then a set A of substrings of T such that all substrings in the 
set have the same number of occurrences, and consider the partial order A on 
A such that V A W iff W = XVY for (possibly empty) strings X and Y. If 
we display to the user just the maximal elements of A with respect to A, we 
guarantee that every over-represented string W € A that we do not output is 
a substring of a string XWY £ A in the output which has at least the same 
score. Symmetrically, if we display just the minimal elements of A with respect 
to A, we guarantee that every under-represented string XWY £ A that we do 
not output is a superstring of a string W £ A in the output which has at most 
the same score. A possible choice for A is the set of all substrings that start at 
exactly the same positions in T: this class has a unique maximal element, which 
corresponds to a node of the suffix tree of T, and a unique minimal element, 
which corresponds to the right extension by a single character of a node of the 
suffix tree of T. Since the number of all such classes is 0(\T\), and since all such 
classes are connected to one another by a trie, known as the suffix-link tree , it 
is possible to devise an algorithm that computes V[/t(W’)] for all the maximal 
and minimal elements of all such classes, in a total amount of time that grows 
linearly in |T| |2|. 

In this paper we engineer the algorithm described in [2] to use as its sub¬ 
strate the Burrows-Wheeler transform of T, rather than space-inefficient data 
structures like the (truncated) suffix tree of T, or space-efficient simulations of 
the suffix tree with 0(log e n) slowdown, like compressed suffix trees (see e.g. 
m and references therein). We also observe that the time complexity of the 
algorithm described in [2] depends on the cardinality of the alphabet, and we 
remove this dependency. Moreover, we adapt the algorithm to work on the 
smallest possible set of equivalence classes, thus reducing time and output size 
in practice. Assuming an alphabet of size a £ o(y/n/ log?z), we can thus perform 
all the computations described in [2] in 0(n) time and in o(n + Xffin) bits of 
space, given the BWT of T and few additional data structures [7], where A is 
the length of a longest repeat of T. For statistical reasons, the maximum length 
of a string to be reported is often 0( log CT n), thus our algorithm uses effectively 
o(n ) bits of space in addition to the input. Concatenating this setup to the 
BWT construction algorithm described in [5], we can discover all the over- and 
under-represented substrings of T, directly from T itself in randomized 0{n) 
time and in 0(n log a) bits of space in addition to T itself. Finally, we extend 
the algorithm in [2] to consider potentially under-represented strings that do 
not occur in T, thus providing for the first time a way to score and rank the 
minimal absent words of T. 


2 Preliminaries 

2.1 Strings 

Let £ = [l..cr] be an integer alphabet, let # = 0 be a separator not in £, let 
T = [l..cr] n_1 # be a string, and let e be the empty string. For reasons that 
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will become clear in Section 2.2 we assume a £ o(y / n/log n) throughout the 


paper. We denote by fr(W) the number of (possibly overlapping) occurrences 
of a string W in the circular version of T. A repeat W is a string that satisfies 
fr(W) > 1. We denote by E^(W) the set of characters {a £ [0..u] : fr{aW) > 
0} and by £^(W) the set of characters {b £ [0..er] : fr(Wb) > 0}. A repeat W 
is right-maximal (respectively, left-maximal) iff |£^(W)| > 1 (respectively, iff 
\E e T (W)\ > 1). It is well known that T can have at most n — 1 right-maximal 
substrings and at most n — 1 left-maximal substrings. A maximal repeat of T 
is a repeat that is both left- and right-maximal. Clearly a maximal repeat W 
of T satisfies Jt{oW) < fr(W) and fr(Wb) < fr(W) for any characters a and 
b in £. A repeat is supermaximal if it is not a proper substring of any other 
repeat. A minimal rare word of T is a string W that satisfies fr(W) < fr(V) 
for every proper substring V oiW. Clearly W must have the form aXb, where 
a and b are characters and X is a maximal repeat of T. If fr(W) > 0, then aX 
is a right-maximal substring of T and Xb is a left-maximal substring of T. If 
fr(W) = 0, then aX (respectively, Xb) must occur in T, but it is not necessarily 
right-maximal (respectively, left-maximal). A minimal rare word of T that does 
not occur in T is called minimal absent word (see e.g. Ena and references 
therein): the total number of such strings can be 0(crn ) [9]. A minimal rare 
word of T that occurs exactly once in T is called minimal unique substring (see 
e.g. m and references therein). It is clear that the total number of minimal 
rare words of T that occur at least once in T is 0(n). 

String V / £ is a proper border of string W if W = VX and W = YV 
for nonempty strings X and Y. A string W can have zero, one, or multiple 
proper borders: we denote by bord(W) the length of the longest border of 
W. Each border of W is followed by a character when it occurs as a prefix, 
and it is preceded by a character when it occurs as a suffix: we use a\W to 
denote the length of the longest border of W that is preceded by character a 
when it occurs as a suffix, and we use W\a to denote the length of the longest 
border of W that is followed by character a when it occurs as a prefix. Clearly 
both a\W and W\a can be zero. We denote by B(W) the set of lengths of all 
borders of W, by B r (W) the set of pairs {(a,a\W) : a £ a, a\W ^ 0}, and 
by B e (W) the set of pairs {(a, W\a) : a £ a, W\a ^ 0}. It is well known that 
B(W) = \bord(W)} U B(V), where V is the longest border of W: see e.g. ITTil 
and references therein. In this paper we will also use leftyv to denote an array 
of size |£y(W)|, indexed by the characters in Ydj,{W) in lexicographic order, 
such that leftw[c] = W|c. Similarly, we will use rightly to denote an array of 
size |£^(W)|, indexed by the characters in £y(W) in lexicographic order, such 
that rightly [c] = c|W. Set B(W) determines all possible ways in which W can 
overlap with itself: specifically, the maximum possible number of occurrences 
of W in a string of length n is f*(W,n ) = \{n — \W\ + 1 )/period{Wf\, where 
period(W) = \W\ — bord(W). When f T {W) needs to be compared to 
where \T'\ ^ \T\, it is customary to divide fr(W) by f*(W. , |T|). It is easy to 
see that the longest border of a random string of length n generated by an IID 
source is expected to tend to a constant as n tends to infinity. 

For reasons of space we assume the reader to be familiar with the notion of 
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suffix tree STy of a string T, which we do not define here. We denote by £(v) 
the string label of a node v in a suffix tree. It is well known that a substring 
W of T is right-maximal iff W = £{v) for some internal node v of STx- We 
assume the reader to be familiar with the notion of suffix link connecting a 
node v with £(v) = aW for some a £ [0..<t] to a node w with £(w) = W: we say 
that w = suffixLink(u) in this case. Here we just recall that suffix links and 
internal nodes of STy form a tree, called the suffix-link tree of T and denoted 
by SLTt, and that inverting the direction of all suffix links yields the so-called 
explicit Weiner links. Given an internal node v and a symbol a £ [0..<r], it might 
happen that string ai{v) does occur in T, but that it is not right-maximal, i.e. 
it is not the label of any internal node of STjc all such left extensions of internal 
nodes that end in the middle of an edge are called implicit Weiner links. An 
internal node v of STp can have more than one outgoing Weiner link, and all 
such Weiner links have distinct labels: in this case, £{v) is a maximal repeat. 
It is known that the number of suffix links (or, equivalently, of explicit Weiner 
links) is upper-bounded by 2 n— 2, and that the number of implicit Weiner links 
can be upper-bounded by 2n — 2 as well. 

If V is a nonempty proper border of W, then E ^ (IF) C E^ (F) and E ^ (IF) — 
Ti’ T (y). Thus, if W is right-maximal (respectively, left-maximal) then V is 
right-maximal (respectively, left-maximal); if W is a maximal repeat, then V is 
a maximal repeat; and if W = aX where a £ E and X is a maximal repeat, 
then V = aY where Y is a maximal repeal]]] 

2.2 Enumerating maximal repeats and minimal rare words 

For reasons of space we assume the reader to be familiar with the notion and 
uses of the Burrows-Wheeler transform of T, including the C array, the rank 
function, and backward searching. In this paper we use BWTr to denote the 
BWT of T, we use range(IF) = [sp(IF)..ep(IF)] to denote the lexicographic 
interval of a string IF in a BWT that is implicit from the context, and we use E ij 
to denote the set of distinct characters that occur inside interval [i..j] of a string 
that is implicit from the context. We also denote by rangeDistinct(i, j) the 
function that returns the set of tuples {(c, rank(c,p c ), rank(c, q c )) : c £ E ij}, 
in any order , where p c and q c are the first and the last occurrence of character 
c inside interval [i..j\, respectively. Here we focus on a specific application of 
BWTy: enumerating all the right-maximal substrings of T, or equivalently all 
the internal nodes of STy. In particular, we use the algorithm described in [3] 
(Section 4.1), which we sketch here for completeness. 

Given a substring IF of T, let bi < b 2 < • • ■ < bk be the sorted sequence of 
all the distinct characters in E^(IF), and let cii,a 2 ,..., be the sequence of 
all the characters in E^(IF), not necessarily sorted. Assume that we represent 

1 Thus, maximal repeats connected by longest border relationships form a tree rooted at the 
empty string: the path from the root to a maximal repeat lists all its borders, and the internal 
nodes of this tree cannot be supermaximal repeats. Similarly, longest border relationships and 
strings aW (respectively, Wo.) where W is a maximal repeat, form a tree rooted at the empty 
string. 


5 



a substring W of T as a pair repr(W) = (chars[l..fc], f irst[l..fc + 1]), where 
chars[z] = bi, range(IE5.;) = [first[i]..first[z + 1] — 1] for i £ [l..fc], and func¬ 
tion range() refers to BWTx- Note that range(W) = [f irst[l].„f irst[fe+l] —1], 
since it coincides with the concatenation of the intervals of the right exten¬ 
sions of W in lexicographic order. If W is not right-maximal, array chars in 
repr(W) has length one. Given a data structure that supports rangeDistinct 
queries on BWTt, and given the C array of T, there is an algorithm that 
converts repr(VF) into the sequence ai,... ,au and into the corresponding se¬ 
quence repr(aiW), ..., repr(ahW), in 0(de) time and 0(a 2 logn) bits of space 
in addition to the input and the output [5], where d is the time taken by the 
rangeDistinct operation per element in its output, and e is the number of 
distinct strings CLiWbj that occur in the circular version of T, where i £ [1 ..h\ 
and j £ [l..fc]. We encapsulate this algorithm into a function that we call 
extendLeft. 

If a,iW is right-maximal, i.e. if array chars in repr(a,;I / b) has length greater 
than one, we push pair (repr(aiW), | W| + l) onto a stack S. In the next iteration 
we pop the representation of a string from the stack and we repeat the process, 
until the stack becomes empty. This process is equivalent to following all the 
explicit Weiner links from the node v of STy with £(v) = W, not necessarily 
in lexicographic order. Thus, running the algorithm from a stack initialized 
with repr(e) is equivalent to performing a depth-first preorder traversal of the 
suffix-link tree of T (but with an arbitrary exploration order on the children of 
each node), which guarantees to enumerate all the right-maximal substrings of 
T. Every operation performed by the algorithm can be charged to a distinct 
node or Weiner link of ST^, thus the algorithm runs in 0(nd) time. We keep the 
depth of the stack to 0(log n) rather than to 0(n) by using the folklore trick of 
pushing at every iteration the pair (repr(aiW), |ajW|) with largest range(aiW) 
first (see e.g. na). Every suffix-link tree level in the stack contains at most 
(j pairs, and each pair takes at most a log n bits of space, thus the total space 
used by the stack is 0(a 2 log 2 n) bits. The following theorem follows from our 
assumption that a £ o(^/n/ logn): 

Theorem 1 ([5])- Let T £ [l..a] n ~ 1 # be a string. Given a data structure 
that supports rangeDistinct queries on BWTt, we can enumerate all the 
right-maximal substrings W ofT, and for each of them we can return \W\, 
repr(W), the sequence ai, 02 ,..., ah of all characters in T,^ t {W) (not necessar¬ 
ily sorted), and the sequence repr(aiW),..., repr(aft W), in 0(nd ) time and in 
0(cr 2 log 2 n) = o(n) bits of space in addition to the input and the output, where 
d is the time taken by the rangeDistinct operation per element in its output. 

Theorem [l] does not specify the order in which the right-maximal substrings 
must be enumerated, nor the order in which the left extensions of a right- 
maximal substring must be returned. The algorithm we just described can be 
adapted to return all the maximal repeats of T, within the same bounds, by 
outputting a right-maximal string W iff |rangeDistinct(sp(W), ep(W))| > 1. 
Computing the minimal rare words that occur in T is also easy: 
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Lemma 1. LetT £ [1 ..a) n ~ 1 ff be a string. Given a data structure that supports 
rangeDistinct queries on BWTjq we can enumerate all the minimal rare words 
W of T that occur at least once in T , and for each of them we can return \ W\ and 
range(W / ), in 0(nd) time and in 0(a 2 log 2 n) = o{n ) bits of space in addition 
to the input and the output, where d is the time taken by the rangeDistinct 
operation per element in its output. 

Proof. We use a technique similar to the one described in [6]. Specifically, we 
adapt Theorem [l] to iterate over all maximal repeats of T , and we allocate a 
temporary array freq[0..er], indexed by all characters in the alphabet. After 
having enumerated a maximal repeat W, we scan repr(VF), we compute the 
number of occurrences of every right extension Wb of W using array first, 
and we write friWb) in freq[h|. Then, for every i £ [l../i], we check whether 
repr (aiW) contains more than one character: if this is the case, then fr^W) > 
friaiWh) for every b £ [0..<r]. Thus, we scan repr(aiW) and for every b in its 
array chars we check whether friaiWb) < fxiWb), by accessing freq[6]. If 
this is the case, then aiWb is a minimal rare word, and its interval in BWTx 
can be derived in constant time from the array first of repr(aiW). At the 
end of this process, we reset array f req to its initial state by scanning repr(IT) 
again. □ 

□ 

Minimal rare words that do not occur in T can be enumerated using a slight 
variation of Lemma [l] as described in [Bj: 

Lemma 2 (0). Let T £ [l-.cr]" be a string. Given a data structure that 
supports rangeDistinct queries on B\NTt, we can enumerate all the minimal 
rare words aWb of T that do not occur in T, where a and b are characters and 
W £ [1..<t]*, and for each of them we can return a, b, \W\, and range(W), in 
0(nd + occ) time and in 0(cr 2 log 2 n) = o{n) bits of space in addition to the 
input and the output, where d is the time taken by the rangeDistinct operation 
per element in its output, and occ is the output size. 

For reasons of space, we assume throughout the paper that d is the time per 
element in the output of a rangeDistinct data structure that is implicit from 
the context. 


3 Computing the border of all right-maximal 
substrings 

As mentioned in the introduction, computing the exact variance of the number 
of occurrences of a string W in a random text of length n can be mapped to 
the computation of the longest border of all suffixes of W [3], and thus takes 
0(|W|) time using the Morris-Pratt algorithm [17] , To compute the longest 
border of all right-maximal substrings of a text T , as well as of all substrings 
Wb of T such that b £ [0..ct] and W is right-maximal, in overall linear time on 
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|T|, we need the following algorithm described in [2], which we sketch here for 
completeness: 

Theorem 2 ([2j). Let T £ [l..a] n be a string. There is an algorithm that com¬ 
putes bord(W) for all right-maximal substrings W ofT, and for all substrings 
W = Vb of T where b £ [l..cr] and V is right-maximal, in 0(n) words of space. 
The running time of this algorithm is linear in n and depends on a. 


Proof sketch. We build the suffix tree ST^ of T, and we assume that every 
node v of ST r stores sets £^(£(u)) and Ej(f(r)) as lexicographically sorted 
lists. We perform a depth-first traversal of the suffix-link tree of T, and we 
store in each node v with label £(v) = W the arrays right^ and left^y de- 

right^/ (respectively, leftyy) is indexed by 


scribed in Section 


recall that 


2.1 

all characters a £ £ jfW) (respectively, b £ YJffW)), and it stores value a\W at 
position righted] (respectively, value W\b at position leftvy[&]). Clearly, 
if right^a] > 0, then bord(aW) = right w ,[a] + 1. If right^a] = 0, 
then bord(aW) is either one (if a matches the last character of W ) or zero. 
Once we know bord(aW), we compute array right aM/ by exploiting the iden¬ 
tity B{ aW) = {bord(aW)} U 15(C), where V is the longest border of aW (see 
Section 2.1): specifically, for every character c £ "E e T (aW), we know that c be¬ 
longs also to £^(C), thus we set right aM/ [c] = right r [c] if c ^ d, and we set 
right alv [d] = bord(aW), where d is the character that precedes the suffix of 
aW of length bord(aW). Since we know that every character c £ also 

belongs to YTffV), we can compute array left a jy from lefty in the same way. 

Every cell of every array right^ can be charged to a (possibly implicit) 
Weiner link of STr, and every cell of every array left^y can be charged to an 
edge of ST t , thus the algorithm uses 0(n) words of space overall in addition 
to STr- However, copying right aW [c] from right v [c], where V is the longest 
border of aW (respectively, left a y/[c] from lefty[cj) requires retrieving c from 
the list of left extensions (respectively, right extensions) of C, or merging such 
list with the corresponding list of aW, which introduces a dependency on er. □ 


A dependency on er can be problematic in data mining applications, where 
the alphabet could be the result of a dense discretization of a continuous range 
(see e.g. Haul] and references therein). To make Theorem [2] independent of 
er, we start from generalizing the algorithm described in |18| to a trie, counting 
return arcs that do not point to the root: 


Lemma 3. Let T = ( V,E,a ) be a trie on alphabet [l..cr], let £(v) = i(e.k) ■ 

£(ek-i) .^( e i) be the label of a node v £ V that is reachable from the root 

with path ei, e^, ■ ■ ■, e*,, where £ E for all i £ [l..fc], and let a\v = w £ V : 
£(w) = a\£{v). The set of return arcs E' = {(u, a|u) : v £ V, a £ [l..er],a|u 7 ^ 0} 
satisfies \E'\ < 2\V\. 

Proof. We say that a return arc (v,a\v) £ E' is of type 1 if there is an edge 
e = (v,w) £ E with £(e) = a , and we say that it is of type 2 otherwise. The 
total number of type-1 return arcs is at most \V\, thus we focus on type-2 return 
arcs. Let A = £(v) and let B = A[l..a\£{v)\. We charge a type-2 return arc 





(v,a\v) to the vertex w that satisfies A = B ■ £(w). Assume that two distinct 
type-2 return arcs (mi, i>i), (m 2 , V 2 ) are charged to the same vertex w. Clearly it 
must be u\ 7 ^ M 2 . If u± and 112 do not lie on the same path of T, then w must be 
the lowest common ancestor of U\ and M 2 , or one of its ancestors (excluding the 
root). Let W be the label of the path from w to the lowest common ancestor of 
iti and M 2 : clearly £(ui) = X\ ■ a ■ W ■ £{w) and £( 112 ) = X 2 ■ b ■ W ■ £(w), where a 
and b are distinct characters and A'i and X 2 are strings of the same length, but 
at the same time it must be Xi ■ a ■ W = X 2 ■ b ■ W, a contradiction. Assume 
thus that Mi and M 2 lie on the same path of T: without loss of generality, let 
Mi be an ancestor of M 2 . Further assume that there is an edge e = (ui,v) £ E 
with £(e) = a. Then it must be that £{u\) = Yi • £(w) = X\ ■ b ■ Y\ and 
£( 112 ) = I 2 • a ■ Y\ ■ £(w) = X 2 ■ Y 2 ■ a ■ Y\ and a / 6 , but at the same time it must 
be that a ■ Y\ = b ■ Yi, a contradiction. □ 

□ 

Lemma 4. Let T £ [l-.er]" be a string. There is an algorithm that computes 
bord(W) for all right-maximal substrings W of T in 0(n) time and words of 
space. 

Proof. We proceed as in Theorem [2j but at every node v of STy with £(v) = 
W, we store set B r {W), sorted in lexicographic order, rather than right^. 
Recall that B r (W) is the set of pairs {(a,a|IY) : a £ [l..cr],a|IY 7 ^ 0}, i.e. 
a representation of the return arcs of Lemma [3] At every node v we merge 
B r (W) with the lexicographically sorted list of characters in E^(1Y), setting 
bord(aW) = a|!Y (which might be zero) for every left extension aW. Once 
we know bord(aW), we build B r (aW) by copying the entire B r (V), where V is 
the longest border of aW, and by updating or inserting pair (d,d\aW) using a 
linear scan of B r (V ), where d is the character that precedes the suffix of aW of 
length bord(aW). This process touches every return arc of Lemmata constant 
number of times. □ 

D 

Lemma[4]can be clearly applied to any trie T of size n on an alphabet of size 
er, in which every node stores its children in lexicographic order: the space used 
by such algorithm in addition to the trie is 0(min{n, Act}), where A is the length 
of a longest path in T. However, Lemma [3] does not generalize to radix trees, 
thus we cannot use it to bound the construction time of left arrays in Theorem 
[2j To achieve this, it suffices to replace left arrays with suitable stacks: 

Lemma 5. Let T £ [l..er]" be a string. There is an algorithm that computes 
bord(W) for all right-maximal substrings W ofT, and for all substrings W = Vb 
ofT where b £ [0..er] and V is right-maximal, in 0(n) time and words of space. 

Proof. We run the algorithm in Lemma |4j keeping a stacks Si, S 2 , ■ ■ ■, S a . As¬ 
sume that, when we visit the node v of STy with £{v) = W, the top of stack 
Sb for all b £ E^(Y) stores value W\b (which might be zero). Assume that, 
in the depth-first traversal of the suffix-link tree of T, we choose to visit the 
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node w with £(w) = aW next: then, we iterate over all characters b £ S^(aW), 
we compute aW\b by accessing position bord(aW) of stack S b , and we push 
aW\b on S b . This works since, if aW can be extended to the right by character 
b, then every suffix of aW can be extended to the right with character b as 
well. This process takes overall 0(n) time, and the size of all stacks Si,... , S a 
is 0(min{n, Act}), where A is the length of a longest repeat of T, since every 
element in every stack can be charged to an edge of STt- □ 

□ 

The information stored by Lemma[5]is enough to compute in 0(n ) time and 
space the longest border of all minimal rare words that occur at least once in T. 
Adapting Lemma [5] to compute the longest border of all minimal absent words 
of T is also easy: 

Lemma 6. Let T £ [1..ct]" be a string. There is an algorithm that computes 
bord(W ) for all minimal absent words W of T in 0(7%+ occ) time and words 
of space, where occ is the size of the output. 

Proof. We proceed as in Lemma [5j but when we visit the node v of ST^ with 
£(v) = aW for some a £ [1..ct], we assume that the top of stack Sb for all 
b £ Hj,(W) stores value aW\b (which might be zero). Assume that, in the 
depth-first traversal of the suffix-link tree of T, we choose to visit the node 
w with £(w) = caW next: then, we iterate over all characters b £ E^(aW), 
we compute caW\b by accessing position bord(caW) of stack Sb, and we push 
caW\b on Sb- This works since, if aW can be extended to the right by character 
b, then the proper suffix of the longest border of caW can be extended to the 
right with character b as well. Every element in every stack can be charged 
either to an edge of ST t or to a minimal absent word of T, thus the size of all 
stacks is 0(min{n + occ, Act}), where A is the length of a longest repeat of T. 


4 Detecting unusual words in small space 

As mentioned in the introduction, in order to compute the exact variance of 
a substring W of a string T, we just need to compute functions <j>(W) and 
7 (W). Specifically, we just need to compute such functions on all right-maximal 
substrings of T, and on all substrings Wa oi T such that W is right-maximal 
P]. Since <j>(W) and 7 fW) can be computed from 4>(V) and 7 (E), where V is 
the longest border of W, it is easy to see that we can adapt Lemma[5]as follows. 
When we visit the node v of STt with £(v) = W, we store 4>{W) and 7 (W), and 
the top of stack Sb for all b £ Yi r T (W) stores the following values (which might 
be zero) in addition to W\b: 

F b (\W\) = D b (\W\)-(F b (W\b)-2(\W\-W\b)-G h (W\b) + \T\-2\W\+W\b) 
G b (\W\) = D b (\W\) ■ (1 + G b (W\b)) 
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where Db(\W\) = n(W[W\b + l..|W| — 1]) • P[6]. Note that such values 
are computed recursively. We derive Tt(W\W\b + l..|W| — 1]) by keeping an 
additional stack that stores n(V) for every suffix V of W. The entire process 
can be implemented using just a data structure that supports rangeDistinct 
queries on BWT^, by implementing Lemma [5] on top of the iterator described 
in Theorem [T] 

Theorem 3. Let T £ [l-.o - ]" -1 ^ be a string. Given a data structure that 
supports rangeDistinct queries on BWTt, we can compute and 7 (W) 

for all right-maximal substrings W of T, and for all substrings W = Vb of T 
such that b £ [l..er] and V is right-maximal, in 0{nd ) time and in o{n + Xy/n) 
bits of space in addition to the input and the output, where d is the time taken 
by the rangeDistinct operation per element in its output and X is the length 
of a longest repeat of T. 

Proof. Recall that the iterator of Theorem [T] returns, for every right-maximal 
substring W of T, the set of all its right extensions in lexicographic order, but 
the set of all its left extensions in arbitrary order. To implement Lemma [5] 
in this case, we just need a temporary array buffer[l..<r] of crlogn bits that 
is initialized to all zeros at the beginning of the traversal. When the itera¬ 
tor visits substring W, we scan B r (W) and we set buffer [a] = a\W for all 
(a, a|W) £ B r (W). Then, for every left extension a* of W provided by the 
iterator, we compute bord(a.iW) by accessing buffer [a,], and we proceed as 
in Lemma [5] Once we have finished processing substring W, we reset buffer 
to its previous state by setting buffer [a] = 0 for all (a, a|W) £ B r (W). The 
claimed space complexity comes from Theorem [T] and from our assumption that 
cJ £ o(y/n/ log n). □ 

a 

For statistical reasons only substrings of length Ojlogg. n) are candidates for 
being over- or under-represented [I], so the space complexity of Theorem [ 3 ] is 
effectively o(n), every surprising substring can be encoded in a constant number 
of machine words, and thus it can be printed in constant timc[^] The technique 
described in Theorem [3] can be used to apply Lemma [4] to tries whose nodes 
do not store the list of their children in lexicographic order. Moreover, it is 
easy to adapt Theorem [3] to output all candidate over- and under-represented 
substrings whose statistical score matches a user-specified criterion, by keeping 
an additional stack of characters of size A log a and by exploiting the fact that 
the iterator in Theorem [T] returns the length of every substring it visits. 

Reducing the number of patterns displayed by a data mining algorithm is 
key for making it useful in practice. According to the monotonicity of the scores 
described in Sectionjl] we can limit the search for over-represented (respectively, 
under-represented) substrings to maximal repeats (respectively, to minimal rare 
words): this observation was already implicit in |T|, and Theorem[3]can be easily 

2 We assume the word RAM model of computation with words of size f2(logn) bits, in 
which all standard operations including multiplication have unit cost. 
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adapted to consider only such candidates. Moreover, in a practical implemen¬ 
tation we can compute and store <fi{W) and 7 (W) just for maximal repeats, 
since the longest border of a maximal repeat is itself a maximal repeat. We 
can also avoid storing numbers F b (\W\), Gf,(\W\) and W\b for all b G T,l r (W) 
on the stacks of Lemma [5j whenever W is a right-maximal substring of T that 
cannot be written as aV for a character a and a maximal repeat V. Within 
the working space budget of Theorem [3j but in time 0(nd + occ), we can also 
compute the border and the statistical scores of all occ minimal absent words 
of T, by adapting Lemma [ 6 ] to work on Theorem [I] such strings are the only 
strings that do not occur in T which could be under-represented in T, however 
they were not reported in previous works mm- The ability to assign a statis¬ 
tical score to minimal absent words could be useful also in other contexts, for 
example in choosing which minimal absent words should be displayed to the 
user, since their total number is O(ncr) in the worst case. 


5 Implementation and experiments 

The algorithms described in this paper are practical: a prototype implementa¬ 
tion that scores all maximal repeats and minimal rare words of a DNA string of 
length 14.8 • 10 6 uses on average just 12 • 10 3 bits for the stack (which can fit in 
the LI cache of current processors), with a few sudden bursts that reach up to 
approximately 6 ■ 10 5 bits (which can fit in the L2 cache of current processors): 
see Figure |T] (left). Since borders are short on average (constant for a random 
string), border information takes a negligible fraction of the stack, which is ap¬ 
proximately equally divided between BWT intervals and quantities related to 
the variance. Since most borders are short, most recursive accesses of the algo¬ 
rithm are directed to the bottom of the stack (Figure [I] right), so the hardware 
can automatically move portions of this region to the LI cache. Moreover, if the 
user knows that the length of the longest border of every substring of the string 
under analysis is upper-bounded by a constant (3, the stack can be made even 
smaller by pushing border and variance information just for string of length at 
most /3. 

Our prototype implementation compares favourably to the previous imple¬ 
mentation of Theorem [ 2 ] described in |4j: given the BWT represented as a 
wavelet tree, our implementation scores all maximal repeats and minimal rare 
substrings in approximately 33 seconds on a 2.50 GHz Intel Xeon E5-2640, while 
the previous implementation takes approximately 57 seconds and a peak of 6 
gigabytes to build a truncated suffix tree and to score only candidate substrings 
of length at most 12, and 2 (respectively, 4) minutes and a peak of 14 gigabytes 
to build a truncated suffix tree and to score only candidate substrings of length 
at most 24 (respectively, 36). Being essentially a tree traversal, our algorithm is 
intrinsically parallel: we are in the process of developing a shared-memory, mul¬ 
tithreaded implementation, and of testing the speedup induced by using more 
than one core. 
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Figure 1: A prototype implementation that scores all maximal repeats and min¬ 
imal rare words of the genome of Sorangium cellulosum (NCL021658 in NCBI). 
Left panel: number of bits taken by the stack in an entire run, sampled every 
two second. Top insert: detail of a typical subinterval without spikes. Bottom 
insert: percent composition of the stack in an entire run. Light gray: informa¬ 
tion to traverse the suffix-link tree using the BWT. White: border information. 
Black: variance information. Right panel: number of recursive read requests per 
string length. Thin line: NCL021658, thick line: a random reshuffle of NCL021658. 
The insert details the interval of lengths [1..21]. The peak around length 12 is 
caused by the retrievals of S(W) generated by substrings of expected length. 
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